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5_^ i Abstract 



We consider the problem of computing the q-gram profile of a string T of size N compressed 

by a context-free grammar with n production rules. We present an algorithm that runs in 

On ■ 0{N — a) expected time and uses 0{n + q + kx^q) space, where N — a < qn\s the exact number 

of characters decompressed by the algorithm and kx^q < iV — a is the number of distinct q-grams 

in T. This simultaneously matches the current best known time bound and improves the best 

^Q , known space bound. Our space bound is asymptotically optimal in the sense that any algorithm 

^^ ' storing the grammar and the q-gram profile must use J7(n + q + kx^q) space. To achieve this we 

• , introduce the q-gram graph that space-efficiently captures the structure of a string with respect 

^ ' to its q-grams, and show how to construct it from a grammar. 

Kj^ 1 Introduction 

m 

r^ I Given a string T, the q-grani profile of T is a data structure that can answer substring frequency 

queries for substrings of length q (q-grams) in 0{q) time. We study the problem of computing the 
q-gram profile from a string T of size A^ compressed by a context-free grammar with n production 
rules. 



o 

en ' The generalization of string algorithms to grammar-based compressed text is currently an active 

area of research. Grammar-based compression is studied because it offers a simple and strict setting 
and is capable of modelling many commonly used compression schemes, such as those in the Lempel- 

K> ' Ziv family [20p21j , with little expansion [2l[Tlj . The problem of computing the q-gram profile has 

5-H ■ its applications in bioinformatics, data mining, and machine learning [H UHfTS] . All are fields where 

handling large amount of data effectively is crucial. Also, the q-gram distance can be computed 
from the q-gram profiles of two strings and used for filtering in string matching p!|[8l ll6ffT9] . 

Recently the first dedicated solution to computing the q-gram profile from a grammar-based 
compressed string was proposed by Goto et al. [7]. Their algorithm runs in 0{qn) expected timqj 
and uses 0{qn) space. This was later improved by the same authors [6j to an algorithm that 
takes 0{N — a) expected time and uses 0{N — a) space, where A^ is the size of the uncompressed 
string, and a is a parameter depending on how well T is compressed with respect to its q-grams. 
N—a < min(q'n, A^) is in fact the exact number of characters decompressed by the algorithm in order 
to compute the q-gram profile, meaning that the latter algorithm excels in avoiding decompressing 



*An extended abstract of this paper appeared at the 24th Conference on Combinatorial Pattern Matching. 
The bound in [7] is stated as worst-case since they assume integer alphabets for fast suffix sorting. We make no 
such assumptions and without it hashing can be used to obtain the same bound in expectation. 



the same character more than once. These algorithms, as weh as the one presented in this paper, 
assume the RAM model of computation with a word size of log A^ bits. 

We present a Las Vegas- type randomized algorithm that gives Theorem 1. 

Theorem 1 Let T he a string of size N compressed by a grammar of size n. The q-gram profile 
can he computed in 0{N — a) expected time and 0{n + q + kT,q) space, where kx^q < N — a is the 
numher of distinct q- grams in T . 

Hence, our algorithm simultaneously matches the current best known time bound and improves 
the best known space bound. Our space bound is asymptotically optimal in the sense that any 
algorithm storing the grammar and the q-gram profile must use Q,{n + q + kT,q) space. 

A straightforward approach to computing the q-gram profile is to first decompress the string 
and then use an algorithm for computing the profile from a string. For instance, we could construct 
a compact trie of the q-grams using an algorithm similar to a suffix tree construction algorithm as 
mentioned in [9], or use Rabin-Karp fingerprints to obtain a randomized algorithm |19j . However, 
both approaches are impractical because the time and space usage associated with a complete 
decompression of T is linear in its size A^ = 0(2"). To achieve our bounds we introduce the q-gram 
graph, a data structure that space efficiently captures the structure of a string in terms of its q- 
grams, and show how to compute the graph from a grammar. We then transform the graph to a 
suffix tree containing the q-grams of T. Because our algorithm uses randomization to construct the 
q-gram graph, the answer to a query may be incorrect. However, as a final step of our algorithm, 
we show how to use the suffix tree to verify that the fingerprint function is collision free and thereby 
obtain Theorem [TJ 

2 Preliminaries and Notation 

2.1 Strings and SufRx Trees 

Let r be a string of length \T\ consisting of characters from the alphabet S. We use T[i : j], 
< i < j < |T|, to denote the substring starting in position i of T and ending in position j of T. 
We define socc{s, T) to be the number of occurrences of the string s in T. 

The suffix tree of T is a compact trie containing all suffixes of T. That is, it is a trie containing 
the strings T[i : \T\ — 1] for i = 0-.\T\ — 1. The suffix tree of T can be constructed in 0(|r|) time 
and uses 0(|T|) space for integer alphabets |3j. The generalized suffix tree is the suffix tree for 
a set of strings. It can be constructed using time and space linear in the sum of the lengths of 
the strings in the set. The set of strings may be compactly represented as a common suffix tree 
(CS-tree). The CS-tree has the characters of the strings on its edges, and the strings start in the 
leaves and end in the root. If two strings have some suffix in common, the suffixes are merged to 
one path. In other words, the CS-tree is a trie of the reversed strings, and is not to be confused 
with the suffix tree. For CS-trees, the following is known. 

Lemma 1 (Shibuya [15]) Given a set of strings represented hy a CS-tree of size n and com- 
prised of characters from an integer alphabet, the generalized suffix tree of the set of strings can be 
constructed in 0{n) time using 0{n) space. 

For a node f in a suffix tree, the string depth sd{v) is the sum of the lengths of the labels on the 
edges from the root to v. We use parent{v) to get the parent of v, and nca{v,u) is the nearest 
common ancestor of the nodes v and u. 



2.2 Straight Line Programs 

A Straight Line Program (SLP) is a context-free grammar in Chomsky normal form that unam- 
bigously derives a string T of length A'^ over the alphabet S. In other words, an SLP 5 is a set of 
n production rules of the form Xi = XiX^- or Xi = a, where a is a character from the alphabet S, 
and each rule is reachable from the start rule X„. Our algorithm assumes without loss of generality 
that the compressed string given as input is compressed by an SLP. 

It is convenient to view an SLP as a directed acyclic graph (DAG) in which each node represents 
a production rule. Consequently, nodes in the DAG have exactly two outgoing edges. An example 
of an SLP is seen in Figure 1(a). When a string is decompressed we get a derivation tree which 
corresponds to the depth-first traversal of the DAG. 

We denote by txi the string derived from production rule Xi, so T = tx„- For convenience we 
say that \Xi\ is the length of the string derived from Xi, and these values can be computed in linear 
time in a bottom-up fashion using the following recursion. For each Xi = XiX^ in S, 



\Xi\ 




if Xi is a nonterminal, 
otherwise. 



Finally, we denote by occ{Xi) the number of times the production rule Xi occurs in the derivation 
tree. We can compute the occurrences using the following linear time and space algorithm due to 
Goto et al. ^. Set occ{Xi) = 1 for i = l..n. For each production rule of the form Xi = XiXr, in 
decreasing order of i, we set occ{Xi) = occ{Xi) + occ{Xi) and similarly for occ{Xr). 

2.3 Fingerprints 

A Rabin-Karp fingerprint function (j) takes a string as input and produces a value small enough 
to let us determine with high probability whether two strings match in constant time. Let s be a 
substring of T, c be some constant, 2N^~^'^ < p < 4A^'^+^ be a prime, and choose 6 G Zp uniformly 
at random. Then, 

(j){s) = 2^ s[k\ ■ b ' mod p. 
fc=i 

Lemma 2 (Rabin and Karp [lOj) Let (p be defined as above. Then, for all < i,j < |r| — q, 

(p{T[i : i + q]) = (p{T[j : j + q]) iff T[i : i + q] = T[j : j + q] w.h.p. 

We denote the case when T[i : i+q] j^ T[j : j+q] and (j){T[i : i+q]) = (p{T[j : j+q]) for some i and j a 
collision, and say that (j) is collision free on substrings of length g in T if <l){T[i : i+q]) = '/'(^[j : j+q]) 
iff T[i : i + q] = T\j : j + q] for all i and j, < i,j < \T\ — q. 

Besides Lemma El fingerprints exhibit the useful property that once we have computed </'(r[i : 
i + q]) we can compute the fingerprint (j){T[i + 1 : i + q + 1]) in constant time using the update 
function, 

(t){T[i + l:i + q + l]) = (j){T[i : i + q])/b - T[i] +T[i + q + 1] ■ b" mod p. 



3 Key Concepts 

3.1 Relevant Substrings 

Consider a production rule Xi = XiXr that derives the string txi = txitxr- Assume that we have 
counted the number of occurrences of q-grams in txi and tx^ separately. Then the relevant substring 
rxi is the smallest substring of tx^ that is necessary and sufficient to process in order to detect 
and count q-grams that have not already been counted. In other words, rx^ is the substring that 
contains q-grams that start in txi and end in tx^ as shown in Figure [H Formally, for a production 
rule Xi = XiXr, the relevant substring is rx, = txjmax(0, \Xi\ —q + 1) : min(|X/| +g — 2, \Xi\ — 1)]. 
We want the relevant substrings to contain at least one q-gram, so we say that a production rule 
Xi only has a relevant substring if \Xi\ > q. From this definition we see that the size of a relevant 
substring is g < |rxj < 2(g — 1). 



Figure 1: The derivation tree for Xi = XiX^ and the relevant susbtring rxi of Xj. 

The concept of relevant substrings is the backbone of our algorithm because of the following. If 
Xi occurs occ{Xi) times in the derivation tree for S, then the substring tx^ occurs at least occ{Xi) 
times in T. It follows that if a q-gram s occurs socc{s,tXi) times in some substring tXi then we 
know that it occurs at least socc{s,tXi) ■ occ{Xi) times in T. Using our description of relevant 
substrings we can rewrite the latter statement to socc{s,txJ ■ occ{Xi) = socc{s,txi) • occ{Xi) + 
socc{s,tXr) ■ occ{Xr) + socc{s,rxJ • occ{Xi) for the production rule Xi = XiXr- By applying this 
recursively to the root X„ of the SLP we get the following Lemma, which is implicit in [7j . 

Lemma 3 Let Sq = {Xi [ Xj E 5 and \Xi\ > q} be the set of production rules that have a relevant 
substring, and let s be some q-gram. Then, 

socc{s,T) = y^ socc{s,rxJ ■ occ{Xi). 

3.2 Prefix and Suflax Decompression 

The following Lemma states a result that is crucial to the algorithm presented in this paper. 



Lemma 4 (G§sieniec et al. [5]) An SLP S of size n can he preprocessed in 0{n) time using 
0{n) extra space such that, given a pointer to a variable Xi in S, the j length prefix and suffix of 
txi can be decompressed in 0{j) time. 

Gcisieniec et al. give a data structure that supports linear time decompression of prefixes, but it is 
easy to extend the result to also hold for suffixes. Let s be some string and s^ the reversed string. 
If we reverse the prefix of length j of s this corresponds to the suffix of length j of s. To obtain an 
SLP for the reversed string we swap the two variables on the right-hand side of each nonterminal 
production rule. The reversed SLP S' contains n production rules and the transformation ensures 
that tx, = tx ^^'^ each production rule Xj/ in S' . A proof of this can be found in [12]. Producing 
the reversed SLP takes linear time and in the process we create pointers from each variable to its 
corresponding variable in the reversed SLP. After both SLP's are preprocessed for linear time prefix 
decompression, a query for the j length suffix of tXi is handled by following the pointer from Xi 
to its counterpart in the reversed SLP, decompressing the j length prefix of this, and reversing the 
prefix. 

3.3 The q-gram Graph 

We now describe a data structure that we call the q-gram graph. It too will play an important 
role in our algorithm. The q-gram graph Gq (T) captures the structure of a string T in terms of its 
q-grams. In fact, it is a subgraph of the De Bruijn graph over T,'^ with a few augmentations to give 
it some useful properties. We will show that its size is linear in the number of distinct q-grams in 
T, and we give a randomized algorithm to construct the graph in linear time in N. 

A node in the graph represents a distinct (g— l)-gram, and the label on the node is the fingerprint 
of the respective {q — l)-gram. The graph has a special node that represents the first {q — l)-gram 
of T and which we will denote the start node. Let x and y be characters and a a string such that 
\a\ = q — 2. There is an edge between two nodes with labels (j){xa) and (t){ay) if xay is a substring 
of T. The graph may contain self-loops. Each edge has a label and a counter. The label of the edge 
{4>{xa) , (j){ay)} is y, and its counter indicates the number of times the substring xay occurs in T. 
Since \xay\ = q this data structure contains information about the frequencies of q-grams in T. 

Lemma 5 The q-gram graph ofT, Gq{T), has 0{kT,q) nodes and 0{kT,q) edges. 

Proof. Each node represents a distinct {q — l)-gram, and because of the way we construct the 
graph, its outgoing edges have unique labels. The combination of a node and an outgoing edge 
thus represents a distinct q-gram, and therefore there can be at most kx^q edges in the graph. For 
each new q-gram the algorithm adds an edge from an existing node to a new node, so the graph is 
connected. Therefore, it has at most kT,q + 1 nodes. D 

The graph can be constructed using the following online algorithm which takes a string T, an 
integer q, and a fingerprint function (j) as input. Let the start node of the graph have the fingerprint 
0(T[O : (g — 1) — 1]). Assume that we have built the graph Gg(T[0 : k -\- {q — 1) — 1]) and that 
we keep its nodes and edges in two dictionaries implemented using hashing. We then compute the 
fingerprint (p{T[k + 1 : k + {q — \)]) for the {q — l)-gram starting in position fc + 1 in T. Recall that 
since this is the next successive q-gram, this computation takes constant time. If a node with label 
4>{T[k -\- \ : k -\- {q — 1)]) already exists we check if there is an edge from (/>(T[A; : k + {q — \) — 1]) 



to (j){T[k + 1 : k + (q — 1)]). If such an edge exists we increment its counter by one. If it does not 
exist we create it and set its counter to 1. If a node with label (piTlk + 1 : k + (q — 1)]) does not 
exist we create it along with an edge from (j){T[k : k -\- {q — 1) — 1]) to it. 

Lemma 6 For a string T of length N, the algorithm is a Monte- Carlo type randomized algorithm 
that builds the q-gram graph Gq{T) in 0{N) expected time. 

4 Algorithm 

Our main algorithm is comprised of four steps: preparing the SLP, constructing the q-gram graph 
from the SLP, turning it into a CS-tree, and computing the suffix tree of the CS-tree. Ultimately 
the algorithm produces a suffix tree containing the reversed q-grams of T, so to answer a query for 
a q-gram s we will have to lookup s^ in the suffix tree. Below we will describe the algorithm and 
we will show that it runs in 0{qn) expected time while using 0{n + kx^q) space; an improvement 
over the best known algorithm in terms of space usage. The catch is that a frequency query to 
the resulting data structure may yield incorrect results due to randomization. However, we show 
how to turn the algorithm from a Monte Carlo to a Las Vegas-type randomized algorithm with 
constant overhead. Finally, we show that by decompressing substrings of T in a specific order, 
we can construct the q-gram graph by decompressing exactly the same number of characters as 
decompressed by the best known algorithm. 

The algorithm is as follows. Figure 1 shows an example of the data structures after each step 
of the algorithm. 

Preprocessing. As the first step of our algorithm we preprocess the SLP such that we know 
the size of the string derived from a production rule, \Xi\, and the number of occurrences in the 
derivation tree, occ{Xi). We also prepare the SLP for linear time prefix and suffix decompressions 
using Lemma m 

Computing the q-gram graph. In this step we construct the q-gram graph Gq{T) from the 
SLP iS. Initially we choose a suitable fingerprint function for the q-gram graph construction algo- 
rithm and proceed as follows. For each production rule Xi = XiX^ in 5, such that \Xi\ > q, we 
decompress its relevant substring rXi ■ Recall from the definition of relevant substrings that rxi is 
the concatenation of the q — 1 length suffix of txi and the q — 1 length prefix of txr- ^^ \Xi\ < q — 1 
we decompress the entire string txi , and similarly for tx^ ■ Given rXi we compute the fingerprint of 
the first {q — l)-gram, (j){rx^ [0 • (? ~ 1) ~ 1])) ^^^ ^^^ the node in Gq{T) with this fingerprint as its 
label. The node is created if it does not exist. Now the construction of Gq{T) can continue from 
this node, albeit with the following change to the construction algorithm. When incrementing the 
counter of an edge we increment it by occ{Xi) instead of 1. 

The q-gram graph now contains the information needed for the q-gram profile; namely the 
frequencies of the q-grams in T. The purpose of the next two steps is to restructure the graph to 
a data structure that supports frequency queries in 0{q) time. 

Transforming the q-gram graph to a CS-tree. The CS-tree that we want to create is basically 
the depth-first tree of Gq{T) with the extension that all edges in Gq{T) are also in the tree. We 
create it as follows. Let the start node of Gq{T) be the node whose label match the fingerprint of 
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the first q — 1 characters of T. Do a depth- first traversal of Gq{T) starting from the start node. 
For a previously unvisited node, create a node in the CS-tree with an incoming edge from its 
predecessor. When reaching a previously visited node, create a new leaf in the CS-tree with an 
incoming edge from its predecessor. Labels on nodes and edges are copied from their corresponding 
labels in Gq{T). We now create a path of length q — 1 with the first 5 — 1 characters of T as labels 
on its edges. We set the last node on this path to be the root of the depth-first tree. The first node 
on the path is the root of the final CS-tree. 

Computing the suffix tree of the CS-tree. Recall that a suffix in the CS-tree starts in a node 
and ends in the root of the tree. Usually we store a pointer from a leaf in the suffix tree to the 
node in the CS-tree from which the particular suffix starts. However, when we construct the suffix 
tree, we store the value of the counter of the first edge in the suffix as well as the label of the first 
node on the path of the suffix. 



X7 

/ \ 

X5 Xq 




/\/\ 


Xi = a X2 = b 




(a) SLR 



1^ XI 

<pibb) 
1 

(b) The 3-gram graph. 





4>{ab) 6<j){ba) b <i){bb) (^{ba)(^{ba) ^{ab)^{ab)(t>{bb)(j){bb) 

(c) CS-tree. (d) Suffix tree. 

Figure 2: An SLP compressing the string ababbbab, and the data structures after each step of the 
algorithm executed with q = 3. 



4.1 Correctness 

Before showing that our algorithm is correct, we will prove some crucial properties of the q-gram 
graph, the CS-tree, and the suffix tree of the CS-tree subsequent to their construction in the 
algorithm. 
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Lemma 7 The q-gram graph Gq(T) constructed from the SLP is connected. 

Proof. Consider a production rule Xi = XiXr- If \Xi\ < 2{q — 1) we decompress the entire string 
txi and insert it into the q-gram graph, and we know that Gq{tXi) is connected. Assume that 
Gq{txi) and Gq{txr) are both connected. We know from Lemma[3]that if we insert ah the relevant 
substrings of the nodes reachable from Xi (including Xi) into the graph, then it will contain all 
{q — l)-grams of txi- Since the first q — 1 characters of rxi is a suffix of ixp the subgraphs Gq{txi) 
and Gq{rxJ will have at least one node in common, and similarly for Gq{txr) and Gq{rXi)- There- 
fore, Gq{Xi) is connected. D 



Lemma 8 Assuming that we are given a fingerprint function (p that is collision free for substrings 
of length q — 1 in T, then the extended CS-tree contains each distinct q-gram in T exactly once. 

Proof. Let t> be a node with an outgoing edge e in Gq{T). The combination of the label of v 
followed by the character on e is a distinct q-gram and occurs only once in Gq{T) due to the way 
we construct it. There may be several paths of length q — l ending in v spelling the same string s, 
and because the fingerprint function is deterministic, there can not be a path spelling s ending in 
some other node. Since the depth- first traversal of Gq(T) only visits e once, the resulting CS-tree 
will only contain the combination of the labels on v and e once. D 



Lemma 9 Assuming that we are given a fingerprint function (p that is collision free for substrings 
of length q — 1 in T, then any node v in the suffix tree of the CS-tree with sd{v) > q is a leaf. 

Proof. Each suffix of length > q in the CS-tree has a distinct q length prefix (Lemma [8]), so 
therefore each node in the suffix tree with string depth > q is a leaf. D 

We have now established the necessary properties to prove that our algorithm is correct. 

Lemma 10 Assuming that we are given a fingerprint function (p that is collision free on all sub- 
strings of length q — 1 ofT, our algorithm correctly computes a q-gram profile for T. 

Proof. Our algorithm inserts each relevant substring rXi exactly once, and if a q-gram s oc- 
curs socc{s,rxJ times in rx^, the counter on the edge representing s is incremented by exactly 
socc{s,rXi) ■ occ{Xi). From Lemma [3] we then know that when Gq{T) is fully constructed, the 
counters on its edges correspond to the frequencies of the q-grams in T. Since Gq(T) is connected 
(Lemma [7|) the modified depth-first traversal will correctly produce a CS-tree, and it contains each 
q-gram from Gq{T) exactly once (Lemma [8|. Finally, we know from Lemma [9] that a node v with 
sd{v) > g in the suffix tree is a leaf, so searching for a string of length q in the suffix tree will yield 
a unique result and can be done in 0{q) time. D 



4.2 Analysis 

Theorem 2 The algorithm runs in 0{qn) expected time and uses 0{n + q + kT,q) space. 

Proof. Let Sq = {Xi j Xj € 5 and \Xi\ > q} be the set of production rules that have a relevant 
substring. For each production rule Xi = XiXr G Sq we decompress its relevant substring of 
size \rXi\ and insert it into the q-gram graph. Since rx^ is comprised of the sufhx of txi and the 
prefix of txr we know from Lemma H] that rxi can be decompressed in OdrxJ) time. Inserting 
rxi into the q-gram graph can be done in 0(|rxj) expected time (Lemma [6|). Since \Sq\ = 0{n) 
and q < |rxj < 2(g — 1) this step of the algorithm takes 0{qn) time. When transforming the 
q-gram graph to a CS-tree we do one traversal of the graph and add q — 1 nodes, so this step 
takes 0{q + kT,q) time. Constructing the suffix tree takes expected linear time in the size of the 
CS-tree if we hash the characters of the alphabet to a polynomial range first (Lemma [TJ. Finally, 
observe that since our algorithm is correct, it detects all q- grams in T and therefore there can be 
at most kx^q < X^x g5 I^-'*^*! ~ 0{qn) distinct q-grams in T. Thus, the expected running time of 
our algorithm is 0{qn). 

In the preprocessing step of our algorithm we use 0{n) space to store the size of the derived 
substrings and the number of occurrences in the derivation tree as well as the data structure needed 
for linear time prefix and suffix decompressions (Lemma U]). The space used by the q-gram graph is 
0{kT,q), and when transforming it to a CS-tree we add at most one new node per edge in the graph 
and extend it by q—1 nodes and edges. Thus, its size is 0{q+kT,q). The CS-tree contains 0{q+kT,q) 
suffixes, so the size of the suffix tree is 0{q+kT,q). In total our algorithm uses 0{n+q+kT,q) space. D 



4.3 Verifying the fingerprint function 

Until now we have assumed that the fingerprints used as labels for the nodes in the q-gram graph 
are collision free. In this section we describe an algorithm that verifies if the chosen fingerprint 
function is collision free using the suffix tree resultant from our algorithm. 

If there is a collision among fingerprints, the q-gram graph construction algorithm will add an 
edge such that there are two paths of length q — 1 ending in the same node while spelling two 
different strings. This observation is formalized in the next Lemma. 

Lemma 11 For each node v in Gq{T), if every path of length q—1 ending in v spell the same 
string, then the fingerprint function used to construct Gq{T) is collision free for all {q — 1) -grams 
in T. 

Proof. From the q-gram graph construction algorithm we know that we create a path of characters 
in the same order as we read them from T. This means that every path of length q — 1 ending in a 
node V represents the q—1 characters generating the fingerprint stored in v, regardless of what comes 
before those q — 1 characters. If all the paths of length q—1 ending in v spell the same string s, then 
we know that there is no other substring s' 7^ s of length q— 1 in T that yields the fingerprint (j){s). D 

It is not straightforward to check Lemma \TT\ directly on the q-gram graph without using too much 
time. However, the error introduced by a collision naturally propagates to the CS-tree and the 
suffix tree of the CS-tree, and as we shall now see, the suffix tree offers a clever way to check for 
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collisions. First, recall that in a leaf v in the suffix tree, we store the fingerprint of the reversed 
prefix of length q — 1 oi the suffix ending in v. Now consider the following property of the suffix 
tree. 

Lemma 12 Let v^f, be the fingerprint stored in a leaf v in the suffix tree. The fingerprint function 
4> is collision free for (q — l)-grams in T if v^f, ^ u^ or sd{nca{v,u)) > q — I for all pairs v,u of 
leaves in the suffix tree. 

Proof. Consider the contrapositive statement: If (j) is not collision free on T then there exists some 
pair v,u for which v^ = u^ and sd{nca{v,u)) < q — 1. Assume that there is a collision. Then at 
least two paths of length q — 1 spelling the same string end in the same node in Gq{T). Regardless 
of the order of the nodes in the depth- first traversal of Gq{T), the CS-tree will have two paths 
of length q — 1 spelling different strings and yet starting in nodes storing the same fingerprint. 
Therefore, the sufhx tree contains two suffixes that differ by at least one character in their q — 1 
length prefix while ending in leaves storing the same fingerprint, which is what we want to show. D 

Checking if there exists a pair of leaves where v^f, = n,^ and sd{nca{v, u)) < q — 1 is straightforward. 
For each leaf we store a pointer to its ancestor w that satisfies sd{w) > q — 1 and sd{parent{w)) < 
q — 1. Then we visit each leaf v again and store v^ in a dictionary along with the ancestor pointer 
just defined. If the dictionary already contains v^f, and the ancestor pointer points to a different 
node, then it means that v^ = Ugf, and sd{nca{v,u)) < q — 1 for some two leaves. 

The algorithm does two passes of the suffix tree which has size 0{q + kT,q). Using a hashing 
scheme for the dictionary we obtain an algorithm that runs in 0{q + kT,q) expected time. 

4.4 Eliminating redundant decompressions 

We now present an alternative approach to constructing the q-gram graph from the SLP. The 
resulting algorithm decompresses fewer characters. 

In our first algorithm for constructing the q-gram graph we did not specify in which order to 
insert the relevant substrings into the graph. For that reason we do not know from which node 
to resume construction of the graph when inserting a new relevant substring. So to determine 
the node to continue from, we need to compute the fingerprint of the first {q — l)-gram of each 
relevant substring. In other words, the relevant substrings are overlapping, and consequently some 
characters are decompressed more than once. Our improved algorithm is based on the following 
observation. Consider a production rule Xi = XiX^. If all relevant substrings of production rules 
reachable from Xi (including rxj have been inserted into the graph, then we know that all q-grams 
in txi are in the graph. Since the q — 1 length prefix of rxi is also a suffix of txi , then we know 
that a node with the label (^(rxJO : [q — 1) — 1]) is already in the graph. Hence, after inserting 
all relevant substrings of production rules reachable from Xi we can proceed to insert rXi without 
having to decompress rxJO : {q — 1) — 1]. 

Algorithm. First we compute and store the size of the relevant substring |rxj = min(g— 1, |X;|) + 
min(g — 1, \Xr\) for each production rule Xi = XiXr in the subset Sq = {Xi | ATj S 5 and Xi > q} 
of the production rules in the SLP. We maintain a linked list L with a pointer to its head and tail, 
denoted by head{L) and tail{L). The list initially contains the leftmost node in Sq, say X^, from 
the root of the SLP and the sentinel value |^a;|- We now start decompressing T by traversing the 
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SLP depth-first, left-to-right. When following a pointer from Xi to a right child, and Xi (z Sq, we 
add Xi and the sentinel value \rxi \ — (q — l) to the back of L. As characters are decompressed they 
are fed to the q-gram graph construction algorithm, and when a counter on an edge in Gq{T) is 
incremented, we increment it by occ{head{L)) . For each character we decompress, we decrement 
the sentinel value for head{L), and if this value becomes we remove the head of the list and set 
head{L) to be the next production rule in the list. When leaving a node Xi we mark it as visited 
and store a pointer from Xi to the node with label (j){tXi^Xi\ — {q — 1) : \Xi\ — 1]) in Gq{T). If we 
encounter a node that has already been visited, we decompress its g — 1 length prefix using the data 
structure of Lemma HI set the node with label </)(txJ|^i| ~ ('7 ~ 1) '■ |^i[ ~ 1]) to be the node from 
where construction of the q-gram graph should continue, and do not proceed to visit its children 
nor add it to L. 

Analysis. Assume without loss of generality that the algorithm is at a production rule deriving 
the string txi = txitXr and all q-grams in txi are in Gq{T). Since we start by decompressing the 
leftmost production rule in Sq there is always such a rule. We then decompress |rxj — {q — 1) 
characters before Xi is removed from the list. We only add a production once to the list, so the 
total number of characters decompressed is (g — 1) + Ylx &s V^t\ ~ {^ ~ ^) — 0{N — a), and we 
hereby obtain our result from Theorem [Ij This is fewer characters than our first algorithm that 
require Xlx g5 \^^i\ characters to be decompressed. Furthermore, it is exactly the same number 
of characters decompressed by the fastest known algorithm due to Goto et al. [6]. 

References 

[1] S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram 
based database searching using a suffix array (QUASAR). In Proc. 3rd RECOMB, pages 
77-83, 1999. 

[2] M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. 
The smallest grammar problem. IEEE Trans. Inf. Theory, 51(7):2554-2576, 2005. 

[3] M. Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th FOCS, pages 
137-143, 1997. 

[4] T. Gartner. A survey of kernels for structured data. ACM SIGKDD Explorations Newsletter, 
5(l):49-58, 2003. 

[5] L. Gcisieniec, R. Kolpakov, I. Potapov, and P. Sant. Real-time traversal in grammar-based 
compressed files. In Proc. 15th DCC, page 458, 2005. 

[6] K. Goto, H. Bannai, S. Inenaga, and M. Takeda. Speeding up q-gram mining on grammar- 
based compressed texts. In Proc. 23rd CPM, pages 220-231, 2012. 

[7] K. Gotou, H. Bannai, S. Inenaga, and M. Takeda. Fast q-gram mining on SLP compressed 
strings. J. Discrete Algorithms, 18(0):89-99, 2013. 

[8] P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. 
In Proc. 16th MFCS, pages 240-248, 1991. 



11 



[9] J. Karkkainen and E. Sutinen. Lempel-Ziv index for q-granis. Algorithmica, 21(1):137-154, 
1998. 

[10] R. M. Karp and M. O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. 
Dev., 31(2):249-260, 1987. 

[11] C. Leshe, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein 
classification. In Proc. PSB, volume 7, pages 566-575, 2002. 

[12] W. Matsubara, S. Inenaga, A. Ishino, A. Shinohara, T. Nakamura, and K. Hashimoto. Efficient 
algorithms to compute compressed longest common substrings and compressed palindromes. 
Theoret. Comput. Scl, 410(8) :900-913, 2009. 

[13] G. PaaB, E. Leopold, M. Larson, J. Kindermann, and S. Eickeler. SVM classification using 
sequences of phonemes and syllables. In Proc. 6th PKDD, pages 373-384, 2002. 

[14] W. Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based 
compression. Theoret. Comput. Sci., 302(l):211-222, 2003. 

[15] T. Shibuya. Constructing the suffix tree of a tree with a large alphabet. lEICE Trans. 
Fundamentals, 86(5): 1061-1066, 2003. 

[16] E. Sutinen and J. Tarhio. On using q-gram locations in approximate string matching. In Proc. 
3rd ESA, pages 327-340, 1995. 

[17] E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string matching. In Proc. 
7th CPM, pages 50-63, 1996. 

[18] T. Takaoka. Approximate pattern matching with samples. In Proc. 5th ISAAC, pages 234-242, 
1994. 

[19] E. Ukkonen. Approximate string-matching with g-grams and maximal matches. Theoret. 
Comput. Sci., 92(1):191-211, 1992. 

[20] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. Information 
Theory, IEEE Trans. Inf. Theory, 23(3):337-343, 1977. 

[21] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. Informa- 
tion Theory, IEEE Trans. Inf. Theory, 24(5):530-536, 1978. 



12 



